|
Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. 〔http://www.techopedia.com/definition/28040/data-lineage〕 It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources. It also enables replaying specific portions or inputs of the dataflow for step-wise debugging or regenerating lost output. In fact, database systems have used such information, called data provenance, to address similar validation and debugging challenges already.〔De, Soumyarupa. (2012). Newt : an architecture for lineage based replay and debugging in DISC systems. UC San Diego: b7355202. Retrieved from: https://escholarship.org/uc/item/3170p7zn〕 Data Lineage provides a visual representation to discover the data flow/movement from its source to destination via various changes and hops on its way in the enterprise environment. ''Data lineage'' represents: how the data hops between various data points, how the data gets transformed along the way, how the representation and parameters change, and how the data splits or converges after each hop. Easier representation of the ''Data Lineage'' can be shown with dots and lines, where dot represents a data container for data point(s) and lines connecting them represents the transformation(s) the data point under goes, between the data containers. Representation of ''Data Lineage'' broadly depends on scope of the ''Metadata Management'' and reference point of interest. ''Data Lineage'' provides sources of the data and intermediate data flow hops from the reference point with Backward data lineage, leads to the final destination's data points and its intermediate data flows with Forward data lineage. These views can be combined with End to End Lineage for a reference point that provides complete audit trail of that data point of interest from source(s) to its final destination(s). As the data points or hops increases, the complexity of such representation becomes incomprehensible. Thus, The best feature of the data lineage view would be to able to simplify the view by temporarily ''Masking'' unwanted peripheral data points. Tools that have the masking feature enables scalability of the view and enhances analysis with best user experience for both Technical and business users alike. Scope of the data lineage determines the volume of metadata required to represent its data lineage. Usually, Data Governance, and Data Management determines the scope of the data lineage based on their regulations, ''enterprise data management strategy'', ''data impact'', ''reporting attributes'', and ''critical data elements'' of the organization. ''Data Lineage'' provides the audit trail of the data points at the lowest granular level,but presentation of the lineage may be done at various zoom levels to simplify the vast information, similar to the ''analytic web maps''. ''Data Lineage'' can be visualized at various levels based on the granularity of the view. At a very high level ''data lineage'' provides what systems the data interacts before it reaches destination. As the granularity increases it goes up to the data point level where it can provide the details of the data point and its historical behavior , attribute properties, and trends and ''Data Quality'' of the data passed through that specific data point in the ''data lineage''. ''Data Governance'' plays a key role in metadata management for guidelines, strategies, policies, implementation. ''Data Quality'', and ''Master Data Management'' helps in enriching the data lineage with more business value. Even though the final representation of ''Data lineage'' is provided in one interface but the way the metadata is harvested and exposed to the data lineage User Interface (UI) could be entirely different. Thus, ''Data lineage'' can be broadly divided into three categories based on the way metadata is harvested:Data lineage involving ''software packages for structured data'', ''Programming Languages'', and ''Big Data''. ''Data lineage'' expects to view at least the technical metadata involving the data points and its various transformations. Along with technical data, ''Data Lineage'' may enrich the metadata with their corresponding Data Quality results,Reference Data values, Data Models, Business Vocabulary, People, Programs, and Systems linked to the data points and transformations. Masking feature in the data lineage visualization allows the tools to incorporate all the enrichments that matter for the specific use case. Metadata normalization may be done in data lineage to represent disparate systems into one common view. Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as data-dependency analysis, error/compromise detection and recovery, and auditing and compliance analysis. "Lineage is a simple type of why provenance."〔 ==Case for Data Lineage== The world of big data is changing dramatically right before our eyes. Statistics say that Ninety percent (90%) of the world’s data has been created in the last two years alone.〔http://newstex.com/2014/07/12/thedataexplosionin2014minutebyminuteinfographic/〕 This explosion of data has resulted in the ever-growing number of systems and automation at all levels in all sizes of organizations. Today, distributed systems like Google Map Reduce,〔Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, January 2008.〕 Microsoft Dryad,〔Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference onComputer Systems 2007, EuroSys ’07, pages 59–72, New York, NY, USA, 2007. ACM.〕 Apache Hadoop 〔Apache Hadoop. http://hadoop.apache.org.〕(an open-source project) and Google Pregel〔Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for largescale graph processing. In Proceedings of the 2010 international conference on Managementof data, SIGMOD ’10, pages 135–146, New York, NY, USA, 2010. ACM.〕 provide such platforms for businesses and users. However, even with these systems, big data analytics can take several hours, days or weeks to run, simply due to the data volumes involved. For example, a ratings prediction algorithm for the Netflix Prize challenge took nearly 20 hours to execute on 50 cores, and a large-scale image processing task to estimate geographic information took 3 days to complete using 400 cores.〔Shimin Chen and Steven W. Schlosser. Map-reduce meets wider varieties of applications. Technical report, Intel Research, 2008.〕 "The Large Synoptic Survey Telescope is expected to generate terabytes of data every night and eventually store more than 50 petabytes, while in the bioinformatics sector, the largest genome 12 sequencing houses in the world now store petabytes of data apiece."〔The data deluge in genomics. https://www-304.ibm.com/connections/blogs/ibmhealthcare/entry/data overload in genomics3?lang=de, 2010.〕 Due to the humongous size of the big data, there could be features in the data that are not considered in the machine learning algorithm, possibly even outliers. It is very difficult for a data scientist to trace an unknown or an unanticipated result. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Data lineage」の詳細全文を読む スポンサード リンク
|